Transposable element-mediated rearrangements are prevalent in human genomes

Transposable elements constitute about half of human genomes, and their role in generating human variation through retrotransposition is broadly studied and appreciated. Structural variants mediated by transposons, which we call transposable element-mediated rearrangements (TEMRs), are less well studied, and the mechanisms leading to their formation as well as their broader impact on human diversity are poorly understood. Here, we identify 493 unique TEMRs across the genomes of three individuals. While homology directed repair is the dominant driver of TEMRs, our sequence-resolved TEMR resource allows us to identify complex inversion breakpoints, triplications or other high copy number polymorphisms, and additional complexities. TEMRs are enriched in genic loci and can create potentially important risk alleles such as a deletion in TRIM65, a known cancer biomarker and therapeutic target. These findings expand our understanding of this important class of structural variation, the mechanisms responsible for their formation, and establish them as an important driver of human diversity.

investigated and compared between different groups by means of statistical tests (mostly Welch's ttest). Moreover, the mechanism behind TEMRs formation has been deciphered for most cases.
The methodology for TEMRs detection and statistical analysis is sound and robust. Several interesting findings are presented and hypotheses about the impact of TEMRs on evolutionary processes are stated. The paper is well written and the presentation is clear.The study is of interest for a wide audience and is worth publishing, but the following additional analyses would greatly improve the impact and clarity of the results. major: (i) The number of statistical tests performed revealed some rules that guided the formation of TEMRs (like longer TE mediate larger rearrangements). It would be great to propose a more comprehensive model explaining this phenomena with the use of nominal logistic regression or Poisson regression.
(ii) Beside the contribution of TEMRs to variation in functional regions, the impact on position effects for rearrangements spanning non-coding regions should be quantified by the analysis of chromatin conformation and the interactions between gene promoters and their regulatory elements.
(iii -optional) All TEMRs detected in the study of three genomes are mediated by Alu and LINE-1 elements, which are ubiquitous in the human DNA. The interesting open question is whether other kinds of transposable elements also mediate rearrangements not necessary by non-allelic homologous recombination mechanisms. Such targeted analysis should cover more genomes, but finding a rearrangement should be feasible if we focus on specific elements (e.g. HERVs) with well-defined parameters.   This is a review of NCOMMS-22-16442-T, "Transposable element-mediated rearrangements are prevalent in human genomes." Herein, the authors use computational methods with deep sequencing data to identify structural variants (SVs) that have transposable elements (TEs) at the junctions in three human individuals, and then perform a series of analyses on these SVs, including some validation experiments with PCR/Sanger sequencing, and describing features of the SVs, such as homology length, degrees of sequence divergence, and sizes of the SVs. Among the 493 TE-associated SVs, most of them were deletions, but some were more complex, and an analysis of these complex events are shown, along with analysis of the propensity for these rearrangements in gene bodies. I found this analysis quite compelling, and its impact in part will be as a guide for mechanistic studies to determine the DNA repair pathways that mediate these events.
Major comments: 1. I suggest that the impact of the study could be enhanced by analyzing whether certain features of TEmediated SVs show correlations, or not. Namely, perform correlation analysis with these parameters they describe: homology / non-homology and also homology length, vs. SV size, and vs. TE divergence. Does homology length, or % homology vs. non-homology correlate with SV size? With TE divergence? Does SV size correlate (positively or negatively) with TE divergence? Since the mechanism of repeatmediated SVs can be affected by repeat distance and divergence (e.g. PMID: 32023454), such analysis would be very useful to the field. Of course, the lack of a correlation might be due to small sample size, but that caveat could be included in the text.
2. For TE divergence, is it possible to relate this to overall degrees of TE divergence for different types of TEs in the genome? Namely, some context for the degrees of TE divergence could enhance the study.

Reviewer #1 (Remarks to the Author):
This manuscript reports a very thorough analysis of transposable element-mediated rearrangements (TEMRs) based on both short-read and PacBio long-read whole-genome sequencing of three individuals (each of different genetic ancestry). The methods are appropriate, and the results are clearly described. I have just a few questions and comments.
We thank the Reviewer for their appreciation of our manuscript and the appropriateness of the methods and data interpretation.
Line 151: The term "non-TEMR" is not defined, though I would assume it refers to rearrangements that are not mediated by TEs. This should be clarified.
To clarify this point, we have amended the text where TEMRs are first introduced to read "SVs with both breakpoints in different TEs of the same element class were categorized as TEMRs (Methods). In contrast, SVs with zero or one breakpoint within a TE, or with both breakpoints within different types of TEs were classified as non-TEMR events." Line 164: It makes sense that TEMRs mediated by full-length LINE-1 elements are larger than those mediate by Alus because of size differences in the two types of TEs. Most LINE-1 elements, however, are truncated to 1 kb or so; thus, it would be predicted that TEMRs mediated by truncated LINE-1s should be shorter than those mediated by fulllength LINE-1s, but shorter than those mediated by Alus. Were truncated LINE-1s examined, and is this the case?
We thank the Reviewer for this insightful question; we did indeed consider both full-length and truncated LINE-1s for this study. We performed the requested analysis, which had the expected result, and have updated the text to reflect the data: "We found that Alu TEMRs (median length of 1,163 bp) are typically shorter than LINE-1 TEMRs 32 (median length of 4,469 bp; p < 1e -5 , Welch's t-test); this includes both full-length (7,663 bp; p < 1e -5 , Welch's t-test) and truncated LINE-1 elements (median length of 3,618 bp; p < 1e -4 , Welch's t-test) (Fig. 1c)." Line 211: It's interesting that the great majority of TEMRs that involved homology-directed repair are Alu-mediated (90%), while the majority that involve non-homologous end joining (60%) are mediated by LINE-1s. Given the sample size, this difference is almost certain to be statistically significant. The authors should discuss some reasons why this difference is seen.
The discrepancy between Alu and L1 for NHE events is indeed significantly different (p < 1e -23 , two-tailed Fisher's exact test). We have added this point in the results: "We found that 89.2% of TEMR-HRs were driven by Alu elements and 62.5% of TEMR-NHEs were driven by LINE-1 elements." Furthermore, we expect that given the relative number of templates for homologous repair, most of the breaks that occur within an Alu will be repaired by homology-mediated processes between the element with a break and a nearby Alu. However, although Alu elements have far more neighboring homologous substrates, they comprise only ~1/2 of the sequence content of the human genome that LINE-1 sequences do. Therefore, the likelihood of getting a random break in two LINE-1 elements followed by NHEJ is much higher than Alu sequences. We have added this point to the discussion and the sentence now reads "Furthermore, given the relative number of templates for homologous repair, most of the breaks that occur within an Alu element will likely be repaired with recombination with a nearby Alu element. Although Alu elements have far more homologous substrates, they comprise only half of the sequence content of the human genome compared to LINE-1 elements. Therefore, the likelihood of getting a random break in two LINE-1 elements followed by non-homologous repair is much higher than this occurring between Alu elements." Three types of TEs are active in humans: Alu, LINE-1, and SVA. Although SVAs are much less prevalent than the other two elements, it would be interesting to know whether they were also considered (or the authors could explain why they were not considered).
We did consider other types of TEs when identifying TEMRs, however, due to the low number of these events and difficulties aligning them to a consensus sequence we initially decided to exclude them. We have now updated the results and added a supplementary We also updated Fig. 1a with an additional category called "Other TEs" which contains non-Alu or LINE-1 TEMRs.
We manually inspected the two SVA-driven TEMRs and found them to be HR driven: Additionally, we inspected the other 48 TEMRs (those driven by TEs other than Alu, LINE-1 and SVA) and found the median homology length at the breakpoint junction to be 4 bp and median size to be 795 bp. Due to the difficulties that preclude extensive mechanistic work with these classes of TEMRs, we added them to Supplemental Table 1, but did not investigate the mechanisms or consequences of these events.

Reviewer #2 (Remarks to the Author):
Interesting and well-conducted research shows the key role of transposable elements (TE) in mediating genomic rearrangements. By analysing the three human genomes, 493 TE mediated rearrangements (TEMRs) were identified using both long read and short read sequencing. For randomly selected 70 TEMRs the precise junctions have been ascertained using PCR and Sanger sequencing. For all rearrangements theirs features (size, homology, orientation, GC content, TE density, etc) were investigated and compared between different groups by means of statistical tests (mostly Welch's t-test). Moreover, the mechanism behind TEMRs formation has been deciphered for most cases.
The methodology for TEMRs detection and statistical analysis is sound and robust. Several interesting findings are presented and hypotheses about the impact of TEMRs on evolutionary processes are stated. The paper is well written and the presentation is clear. The study is of interest for a wide audience and is worth publishing, but the following additional analyses would greatly improve the impact and clarity of the results.
We thank the Reviewer for their kind words about our manuscript, insightful evaluation of our analysis and findings, and recognition of the important role of transposable elements in the formation of structural variants. major: (i) The number of statistical tests performed revealed some rules that guided the formation of TEMRs (like longer TE mediate larger rearrangements). It would be great to propose a more comprehensive model explaining this phenomena with the use of nominal logistic regression or Poisson regression.
We appreciate the potential benefit of applying regression models to our callset. We interrogated seven features: percent similarity, homology length, TEMR size, 5′ TE size, 3′ TE size, GC percentage of 5′ TE, and GC percentage of 3′ TE with a logistic regression model using (with KFold cross validation; k=10) the sklearn package in python.
We obtained the estimated coefficients for the features used with the following scores: accuracy of 99% on training dataset, 94% on the test dataset, precision of 96%, recall of 96.5%, and F1 score of 92.6%. Certain criteria for the features used in this study were clearly linked with the mechanisms because of prior knowledge, such as higher homology length in HR-driven, indirect TEs driving NHE events. We believe with the given features used in this study and the sample size we are unable to fit a regression model to explain the different mechanisms.

Features
Additionally, we performed correlation analysis between the features used to study TEMRs in this manuscript and the mechanisms of rearrangement. We added the following paragraph to the result "We grouped TEMRs based on their mechanism (HR / NHE), family (Alu / LINE-1) and orientation of the TE involved (Direct / Indirect) and performed correlation analysis among three main characteristics used to discern TEMR mechanisms: the length of the TEMR, the tract length of homology at the breakpoint junction, and the similarity between the two TEs involved in the rearrangement (Supplementary Table 7 (ii) Beside the contribution of TEMRs to variation in functional regions, the impact on position effects for rearrangements spanning non-coding regions should be quantified by the analysis of chromatin conformation and the interactions between gene promoters and their regulatory elements.
Since TADs are well conserved within and across species, we decided to utilize GM12878, a well characterized lymphoblastoid genome, for this analysis. We intersected 493 TEMRs with TADs identified in GM12878 (PMID: 25497547) and updated the manuscript accordingly: "Further, we intersected 493 TEMRs with topologically associating domains (TADs) identified in GM12878 67 and found that 459 (83.1%) TEMRs were present completely within TADs and 1 TEMR was present at the edge of a TAD" (iii -optional) All TEMRs detected in the study of three genomes are mediated by Alu and LINE-1 elements, which are ubiquitous in the human DNA. The interesting open question is whether other kinds of transposable elements also mediate rearrangements not necessary by non-allelic homologous recombination mechanisms. Such targeted analysis should cover more genomes, but finding a rearrangement should be feasible if we focus on specific elements (e.g. HERVs) with well-defined parameters.
In fact, we did consider other types of TEs when identifying TEMRs, however, due to smaller sample size and difficulties in aligning them to consensus sequences, we decided to exclude them. We do agree that with a larger TEMR sample size we could potentially uncover other TEs driving rearrangements, but this will require more analysis of additional genomes and extensive manual curation (with current techniques).
Based on this question we have updated the results and added a supplementary table (Supplementary Table 1) with the size information of each additional TEMR type. The manuscript now includes the following: "From our high-confidence callset of 5,297 SVs, we identified 543 nonredundant TEMRs (10.25%) across all three individuals (Fig. 1a).  Table 1). Due to the prevalence of LINE-1 and Alumediated events, the difficulties in aligning ERVs and divergent transposons to consensus sequences, and the small number of TEMRs driven by non-Alu and LINE-1 categories precluding extensive mechanistic work, we focused on the two primary categories of TEMR in this study." We also updated Fig. 1a with an additional category for "Other TEs" which comprises non-Alu or LINE-1 TEMRs.
We manually inspected the two SVA-driven TEMRs and found them to be HR driven: We also inspected the 48 Other TEMRs (those driven non-Alu, LINE-1 and SVA TEs) and found the median homology length at the breakpoint junction to be 4 bp and median size to be 795 bp. Due to the difficulties that preclude extensive mechanistic work with these classes of TEMRs, we added them to Supplemental Additionally, these three individuals have been studies as a part of the 1000GP phase 3, HGSVC phase 1, and HGSVC phase 2 studies, which provides us with extensive genomic data ranging from short read sequencing to long read sequencing, Bionano genomics data and RNA-Seq that can be used for further understanding of TEMRs.

Reviewer #3 (Remarks to the Author):
This is a review of NCOMMS-22-16442-T, "Transposable element-mediated rearrangements are prevalent in human genomes." Herein, the authors use computational methods with deep sequencing data to identify structural variants (SVs) that have transposable elements (TEs) at the junctions in three human individuals, and then perform a series of analyses on these SVs, including some validation experiments with PCR/Sanger sequencing, and describing features of the SVs, such as homology length, degrees of sequence divergence, and sizes of the SVs. Among the 493 TEassociated SVs, most of them were deletions, but some were more complex, and an analysis of these complex events are shown, along with analysis of the propensity for these rearrangements in gene bodies. I found this analysis quite compelling, and its impact in part will be as a guide for mechanistic studies to determine the DNA repair pathways that mediate these events.
We thank the Reviewer for their positive assessment of the findings and methodologies used in our manuscript.
Major comments: 1. I suggest that the impact of the study could be enhanced by analyzing whether certain features of TE-mediated SVs show correlations, or not. Namely, perform correlation analysis with these parameters they describe: homology / nonhomology and also homology length, vs. SV size, and vs. TE divergence. Does homology length, or % homology vs. non-homology correlate with SV size? With TE divergence? Does SV size correlate (positively or negatively) with TE divergence? Since the mechanism of repeat-mediated SVs can be affected by repeat distance and divergence (e.g. PMID: 32023454), such analysis would be very useful to the field. Of course, the lack of a correlation might be due to small sample size, but that caveat could be included in the text.
We thank the Review for this suggestion and have incorporated the correlation analysis among the features used to study TEMRs in this manuscript. We added the following paragraph to the result "We grouped TEMRs based on their mechanism (HR / NHE), family (Alu / LINE-1) and orientation of the TE involved (Direct / Indirect) and performed correlation analysis among three main characteristics used to discern TEMR mechanisms: the length of the TEMR, the tract length of homology at the breakpoint junction, and the similarity between the two TEs involved in the rearrangement (Supplementary Table 7